Add node REPLACE shutdown implementation #76247

dakrone · 2021-08-09T15:18:23Z

This adds the implementation for the REPLACE shutdown type. Currently a shutdown requires a target node and moves the shutdown node's data to the target node, ignoring certain allocation deciders.

Relates to #70338

henningandersen

I had a first look and have a few questions.

Other than that my main comment is that we will be putting a lot of trust onto the caller, since if they do it in any way wrongly, nodes will run out of disk space or mess things up (for instance put regular shards on frozen nodes). This might be ok but it feels a bit like removing the guardrails. Will give this some more thought...

...a/org/elasticsearch/cluster/routing/allocation/decider/NodeReplacementAllocationDecider.java

...in/java/org/elasticsearch/cluster/routing/allocation/decider/NodeReplaceOverrideWrapper.java

...er/src/main/java/org/elasticsearch/cluster/routing/allocation/decider/AllocationDecider.java

henningandersen

I left a few comments on the new approach.

...er/src/main/java/org/elasticsearch/cluster/routing/allocation/decider/AllocationDecider.java

...src/main/java/org/elasticsearch/cluster/routing/allocation/decider/DiskThresholdDecider.java

.../main/java/org/elasticsearch/cluster/routing/allocation/decider/FilterAllocationDecider.java

...er/src/main/java/org/elasticsearch/cluster/routing/allocation/decider/AllocationDecider.java

.../main/java/org/elasticsearch/cluster/routing/allocation/decider/FilterAllocationDecider.java

...a/org/elasticsearch/cluster/routing/allocation/decider/NodeReplacementAllocationDecider.java

henningandersen · 2021-08-25T08:35:03Z

...a/org/elasticsearch/cluster/routing/allocation/decider/NodeReplacementAllocationDecider.java

+        } else if (isReplacementTarget(allocation.metadata(), node.getId())) {
+            return Decision.single(Decision.Type.NO, NAME,
+                "node [%s] is a node replacement target, shards cannot auto expand to be on it until the replacement is complete",
+                node.getId(), "source");
+        } else if (isReplacementSource(allocation.metadata(), node.getId())) {
+            return Decision.single(Decision.Type.NO, NAME,
+                "node [%s] is being replaced, shards cannot auto expand to be on it", node.getId());


I guess it could be ok, but maybe worth discussing anyway if we should count a replacing source,target pair as one for auto-expand?

The benefit of contracting auto-expand is obviously that we might drop a replica on the source and thus speed up the time. On the other hand, we are jeopardizing data safety.

At least it means REPLACE has some "express-vacate" semantics, i.e., should not be used for softer vacates.

I was on the fence about this also, I don't have a strong feeling about this, do you think we should go one way or another?

The difficulty with considering a source-target pair as one for auto-expand is that it could have been invalid to expand to the source node for other reasons, but valid on the target node, and that would but it in the "could be dangerous to add another shard to the target node" category. Hence, for safety I went with the "neither" approach for now.

More than the classical "give me a copy on every node for search performance" I am more worried about our use of 0-1 on system indices like security indices.

In particular in the replace case, we risk that the reduce of replica count causes the remaining copy to reside on the source node, waiting to be vacated. Given that we are running against time here, that adds some risk of loosing the last copy of data. To become safe, we should also ensure that any copies on the target/source node is the one that we get rid of, which would be complicated.

Regarding your case of it being invalid on the source node, I think there are really two cases there:

The shard already exists on the source node (i.e., the node was excluded somehow but the shard has not yet moved off).

The shard does not exist on the source node (but on some other nodes).

In neither case will we allow establishing a replica on the target node, NodeReplacementAllocationDecider will prevent that. Only relocations to target node will pass through canAllocate.

In the 1st case we would in many cases have to vacate the shard anyway. So the benefit is only in the case where we are lucky that the copy on the source node is the one we remove. Already an edge case since the shard is not supposed to reside on the source node altogether.

In the 2nd case there would be no additional benefit of not reducing replica count and no drawback of increasing it.

I suppose this is again a matter of data safety vs how sure we are the replace succeeds. I am inclined to prefer data safety 🙂, in particular in this case since I think the drawbacks are minor and luck-related.

...java/org/elasticsearch/cluster/routing/allocation/decider/NodeShutdownAllocationDecider.java

...a/org/elasticsearch/cluster/routing/allocation/decider/NodeReplacementAllocationDecider.java

...er/src/main/java/org/elasticsearch/cluster/routing/allocation/decider/AllocationDecider.java

...ain/java/org/elasticsearch/cluster/routing/allocation/decider/MaxRetryAllocationDecider.java

gwbrown

Thanks, this is looking good! Left a few comments but nothing major.

...ain/java/org/elasticsearch/cluster/routing/allocation/allocator/BalancedShardsAllocator.java

...a/org/elasticsearch/cluster/routing/allocation/decider/NodeReplacementAllocationDecider.java

henningandersen · 2021-10-04T14:55:10Z

A separate problem with the current implementation is that we will not allow allocating replicas to a target node where data copies already exists. This could be a problem if the target node falls out of the cluster for a short while and then joins again.

This poses following potential issues:

We would run with reduced availability for an extended perio.
The cluster may fill up due to allocating the replicas elsewhere.

We should look into allowing to allocate the replicas back to the target node to avoid this. There are a few options to consider:

Allow the allocation if the "last node id" in UnassignedInfo matches the node, even if it is a target of a replace.
Allow the allocation if the ReplicaShardAllocator discovers that there is a retention lease for a copy of data on the node. It probably needs a new method like canAllocateReplicaToCopyWithData to handle that.

I think option 2 is the best one.

It can be argued that by allowing the allocation, we risk more data on the target node. But the amount should be similar to what we would have had if the node had stayed in the cluster and therefore this seems ok.

elasticmachine · 2021-10-04T19:33:23Z

Pinging @elastic/es-core-infra (Team:Core/Infra)

dakrone · 2021-10-04T19:37:11Z

@gwbrown @henningandersen okay, I'm slightly concerned that the complexity of this PR is getting a little out of hand (at least it's hard for me to follow since it's been so long-lived). If it's okay with you I think I'd rather lay the base down for node replacement in this PR and then work on the other scenarios in separate work.

As far as I know, there are two things that this PR doesn't address:

The issue with allocating replicas where the data already exists (the comment you added)
The "block_writes" flag we discussed to handle the single-node cluster situation

If you agree that those can be done in subsequent work, could you both take a fresh look at this PR? (I promise I won't force-push any more now that it's not a draft PR)

…cider

dakrone · 2021-10-06T17:53:01Z

@henningandersen thanks for taking another look, I think I addressed all your comments again

henningandersen

LGTM, thanks for iterating on this with me. I hope @gwbrown can find the time for a review too.

henningandersen · 2021-10-06T18:09:58Z

...down/src/internalClusterTest/java/org/elasticsearch/xpack/shutdown/NodeShutdownShardsIT.java

+        internalCluster().startNode(Settings.builder().put("node.name", nodeB));
+        internalCluster().startNode(Settings.builder().put("node.name", nodeC));


I wonder if we can swap the order of starting nodes? Suppose we allowed allocating to both nodes (a bug). But we then risk the shards moving over to B (they only have to start the relocation) before node C have started?

We do have another test that starts nodeC prior to nodeB for the replacement (it was one of the changes you asked for), which I think satisfies this?

Yeah, I did notice that and thanks for doing that. I was only adding this comment here because this test is named testNodeReplacementOnlyToTarget and it seems wrong to then not have another node that shards could be allocated to. But I agree that the case is covered so you can consider this comment optional for sure.

gwbrown

LGTM, thanks Lee!

henningandersen · 2021-10-07T05:57:02Z

...down/src/internalClusterTest/java/org/elasticsearch/xpack/shutdown/NodeShutdownShardsIT.java

+        // Create an index and pin it to nodeA, when we replace it with nodeB,
+        // it'll move the data, overridding the `_name` allocation filter


Can we update this comment, looks like a copy-paste error, since we do not pin the index here?

…cider

elasticsearchmachine · 2021-10-07T16:11:15Z

💔 Backport failed

Status	Branch	Result
❌	7.x	Commit could not be cherrypicked due to conflicts

You can use sqren/backport to manually backport by running backport --upstream elastic/elasticsearch --pr 76247

* WIP, basic implementation * Pull `if` branch into a variable * Remove outdated javadoc * Remove map iteration, use target name instead of id (whoops) * Remove streaming from isReplacementSource * Simplify getReplacementName * Only calculate node shutdowns if canRemain==false and forceMove==false * Move canRebalance comment in BalancedShardsAllocator * Rename canForceDuringVacate -> canForceAllocateDuringReplace * Add comment to AwarenessAllocationDecider.canForceAllocateDuringReplace * Revert changes to ClusterRebalanceAllocationDecider * Change "no replacement" decision message in NodeReplacementAllocationDecider * Only construct shutdown map once in isReplacementSource * Make node shutdowns and target shutdowns available within RoutingAllocation * Add randomization for adding the filter that is overridden in test * Add integration test with replicas: 1 * Go nuts with the verbosity of allocation decisions * Also check NODE_C in unit test * Test with randomly assigned shard * Fix test for extra verbose decision messages * Remove canAllocate(IndexMetadat, RoutingNode, RoutingAllocation) overriding * Spotless :| * Implement 100% disk usage check during force-replace-allocate * Add rudimentary documentation for "replace" shutdown type * Use RoutingAllocation shutdown map in BalancedShardsAllocator * Add canForceAllocateDuringReplace to AllocationDeciders & add test * Switch from percentage to bytes in DiskThresholdDecider force check * Enhance docs with note about rollover, creation, & shrink * Clarify decision messages, add test for target-only allocation * Simplify NodeReplacementAllocationDecider.replacementOngoing * Start nodeC before nodeB in integration test * Spotleeeessssssss! You get me every time! * Remove outdated comment # Conflicts: # x-pack/plugin/shutdown/src/test/java/org/elasticsearch/xpack/shutdown/TransportGetShutdownStatusActionTests.java

* master: Fix DataTierTests package and add a validation test (elastic#78880) Fix split package org.elasticsearch.common.xcontent (elastic#78831) Store DataTier Preference directly on IndexMetadata (elastic#78668) [DOCS] Fixes typo in calendar API example (elastic#78867) Improve Node Shutdown Observability (elastic#78727) Convert encrypted snapshot license object to LicensedFeature (elastic#78731) Revert "Make nodePaths() singular (elastic#72514)" (elastic#78801) Fix incorrect generic type in PolicyStepsRegistry (elastic#78628) [DOCS] Fixes ML get calendars API (elastic#78808) Implement GET API for System Feature Upgrades (elastic#78642) [TEST] More MetadataStateFormat tests (elastic#78577) Add support for rest compatibility headers to the HLRC (elastic#78490) Un-ignoring tests after backporting fix (elastic#78830) Add node REPLACE shutdown implementation (elastic#76247) Wrap VersionPropertiesLoader in a BuildService to decouple build logic projects (elastic#78704) Adjust /_cat/templates not to request all metadata (elastic#78829) [DOCS] Fixes ML get scheduled events API (elastic#78809) Enable exit on out of memory error (elastic#71542) # Conflicts: # server/src/main/java/org/elasticsearch/cluster/metadata/DataStream.java

This commit enhances `DiskThresholdMonitor` so that indices that have a flood-stage block will not have the block removed while they reside on a node that is part of a "REPLACE"-type node shutdown. This prevents a situation where a node is blocked due to disk usage, then during the replacement the block is removed while shards are relocating to the target node, indexing occurs, and then the target runs out of space due to the additional documents. Relates to elastic#70338 and elastic#76247

#78942) This commit enhances `DiskThresholdMonitor` so that indices that have a flood-stage block will not have the block removed while they reside on a node that is part of a "REPLACE"-type node shutdown. This prevents a situation where a node is blocked due to disk usage, then during the replacement the block is removed while shards are relocating to the target node, indexing occurs, and then the target runs out of space due to the additional documents. Relates to #70338 and #76247

elastic#78942) This commit enhances `DiskThresholdMonitor` so that indices that have a flood-stage block will not have the block removed while they reside on a node that is part of a "REPLACE"-type node shutdown. This prevents a situation where a node is blocked due to disk usage, then during the replacement the block is removed while shards are relocating to the target node, indexing occurs, and then the target runs out of space due to the additional documents. Relates to elastic#70338 and elastic#76247 # Conflicts: # server/src/test/java/org/elasticsearch/cluster/routing/allocation/DiskThresholdMonitorTests.java

#78942) (#79008) This commit enhances `DiskThresholdMonitor` so that indices that have a flood-stage block will not have the block removed while they reside on a node that is part of a "REPLACE"-type node shutdown. This prevents a situation where a node is blocked due to disk usage, then during the replacement the block is removed while shards are relocating to the target node, indexing occurs, and then the target runs out of space due to the additional documents. Relates to #70338 and #76247 # Conflicts: # server/src/test/java/org/elasticsearch/cluster/routing/allocation/DiskThresholdMonitorTests.java

This commit allows replica shards that have existing data on disk to be re-allocated to the target of a "REPLACE" type node shutdown. Prior to this if the target node of a shutdown were to restart, the replicas would not be allowed to be allocated even if their data existed on disk. Relates to elastic#70338 as a follow-up to elastic#76247

…ement (#79171) This commit allows replica shards that have existing data on disk to be re-allocated to the target of a "REPLACE" type node shutdown. Prior to this if the target node of a shutdown were to restart, the replicas would not be allowed to be allocated even if their data existed on disk. Relates to #70338 as a follow-up to #76247

…ement (elastic#79171) This commit allows replica shards that have existing data on disk to be re-allocated to the target of a "REPLACE" type node shutdown. Prior to this if the target node of a shutdown were to restart, the replicas would not be allowed to be allocated even if their data existed on disk. Relates to elastic#70338 as a follow-up to elastic#76247

…ement (#79171) (#79266) This commit allows replica shards that have existing data on disk to be re-allocated to the target of a "REPLACE" type node shutdown. Prior to this if the target node of a shutdown were to restart, the replicas would not be allowed to be allocated even if their data existed on disk. Relates to #70338 as a follow-up to #76247

dakrone requested a review from henningandersen August 9, 2021 15:18

elasticsearchmachine added the v8.0.0 label Aug 9, 2021

henningandersen reviewed Aug 9, 2021

View reviewed changes

dakrone force-pushed the node-replacement-decider branch from 1d987c8 to f2792ab Compare August 24, 2021 16:51

henningandersen reviewed Aug 25, 2021

View reviewed changes

...a/org/elasticsearch/cluster/routing/allocation/decider/NodeReplacementAllocationDecider.java Outdated Show resolved Hide resolved

henningandersen reviewed Aug 30, 2021

View reviewed changes

...a/org/elasticsearch/cluster/routing/allocation/decider/NodeReplacementAllocationDecider.java Show resolved Hide resolved

dakrone force-pushed the node-replacement-decider branch from e59ba7a to c93054c Compare August 30, 2021 18:57

henningandersen reviewed Aug 31, 2021

View reviewed changes

...er/src/main/java/org/elasticsearch/cluster/routing/allocation/decider/AllocationDecider.java Outdated Show resolved Hide resolved

henningandersen reviewed Aug 31, 2021

View reviewed changes

...ain/java/org/elasticsearch/cluster/routing/allocation/decider/MaxRetryAllocationDecider.java Show resolved Hide resolved

dakrone force-pushed the node-replacement-decider branch 3 times, most recently from b6505b2 to 08eda8d Compare September 7, 2021 16:49

dakrone force-pushed the node-replacement-decider branch from 08eda8d to 92770f0 Compare September 10, 2021 15:42

dakrone requested a review from gwbrown September 10, 2021 22:58

gwbrown reviewed Sep 17, 2021

View reviewed changes

dakrone added 6 commits October 4, 2021 13:26

WIP, basic implementation

d1b8218

Pull if branch into a variable

fd0c597

Remove outdated javadoc

41c6238

Remove map iteration, use target name instead of id (whoops)

a96e596

Remove streaming from isReplacementSource

238112a

Simplify getReplacementName

2402803

dakrone force-pushed the node-replacement-decider branch from c9fe05a to 2402803 Compare October 4, 2021 19:28

dakrone marked this pull request as ready for review October 4, 2021 19:31

dakrone added the :Core/Infra/Node Lifecycle Node startup, bootstrapping, and shutdown label Oct 4, 2021

elasticmachine added the Team:Core/Infra Meta label for core/infra team label Oct 4, 2021

dakrone added the v7.16.0 label Oct 4, 2021

dakrone added 2 commits October 6, 2021 11:51

Merge remote-tracking branch 'origin/master' into node-replacement-de…

c27ea49

…cider

Spotleeeessssssss! You get me every time!

c237f95

henningandersen approved these changes Oct 6, 2021

View reviewed changes

gwbrown approved these changes Oct 7, 2021

View reviewed changes

henningandersen reviewed Oct 7, 2021

View reviewed changes

dakrone added 2 commits October 7, 2021 09:01

Remove outdated comment

f692e63

Merge remote-tracking branch 'origin/master' into node-replacement-de…

f31c949

…cider

dakrone added auto-backport Automatically create backport pull requests when merged auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) labels Oct 7, 2021

elasticsearchmachine merged commit 6e875d0 into elastic:master Oct 7, 2021

dakrone mentioned this pull request Oct 7, 2021

[7.x] Add node REPLACE shutdown implementation (#76247) #78840

Merged

dakrone mentioned this pull request Oct 11, 2021

Do not remove flood block from indices on nodes undergoing replacement #78942

Merged

dakrone mentioned this pull request Oct 14, 2021

Allow re-allocation of replica shards on nodes during shutdown replacement #79171

Merged

jakelandis added v8.0.0-beta1 and removed v8.0.0 labels Oct 27, 2021

danhermann added the >non-issue label Dec 3, 2021

henningandersen mentioned this pull request Aug 23, 2022

Shutdown replace contracts auto-expand replicas #89527

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add node REPLACE shutdown implementation #76247

Add node REPLACE shutdown implementation #76247

dakrone commented Aug 9, 2021 •

edited

Loading

henningandersen left a comment

henningandersen left a comment

henningandersen Aug 25, 2021

dakrone Sep 1, 2021

henningandersen Sep 2, 2021 •

edited

Loading

gwbrown left a comment

henningandersen commented Oct 4, 2021

elasticmachine commented Oct 4, 2021

dakrone commented Oct 4, 2021

dakrone commented Oct 6, 2021

henningandersen left a comment

henningandersen Oct 6, 2021

dakrone Oct 6, 2021

henningandersen Oct 7, 2021

gwbrown left a comment

henningandersen Oct 7, 2021

elasticsearchmachine commented Oct 7, 2021

		internalCluster().startNode(Settings.builder().put("node.name", nodeB));
		internalCluster().startNode(Settings.builder().put("node.name", nodeC));

		// Create an index and pin it to nodeA, when we replace it with nodeB,
		// it'll move the data, overridding the `_name` allocation filter

Add node REPLACE shutdown implementation #76247

Add node REPLACE shutdown implementation #76247

Conversation

dakrone commented Aug 9, 2021 • edited Loading

henningandersen left a comment

Choose a reason for hiding this comment

henningandersen left a comment

Choose a reason for hiding this comment

henningandersen Aug 25, 2021

Choose a reason for hiding this comment

dakrone Sep 1, 2021

Choose a reason for hiding this comment

henningandersen Sep 2, 2021 • edited Loading

Choose a reason for hiding this comment

gwbrown left a comment

Choose a reason for hiding this comment

henningandersen commented Oct 4, 2021

elasticmachine commented Oct 4, 2021

dakrone commented Oct 4, 2021

dakrone commented Oct 6, 2021

henningandersen left a comment

Choose a reason for hiding this comment

henningandersen Oct 6, 2021

Choose a reason for hiding this comment

dakrone Oct 6, 2021

Choose a reason for hiding this comment

henningandersen Oct 7, 2021

Choose a reason for hiding this comment

gwbrown left a comment

Choose a reason for hiding this comment

henningandersen Oct 7, 2021

Choose a reason for hiding this comment

elasticsearchmachine commented Oct 7, 2021

💔 Backport failed

dakrone commented Aug 9, 2021 •

edited

Loading

henningandersen Sep 2, 2021 •

edited

Loading